Reading: <a href ="http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf"> Generative and Disciminative Classifiers </a> by Tom Mitchell.
| $O$ Is older than 35 years |
$I$ Has Personal Income |
$S$ Is a Student |
$J$ Birthday before July 1st |
$Y$ Buys computer |
|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 |
What to predict for the next customer? We want to find $ P(Y=0| O=0, I = 0, S = 1, J = 1)$
| $O$ Is older than 35 years |
$I$ Has Personal Income |
$S$ Is a Student |
$J$ Birthday before July 1st |
$Y$ Buys computer |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | ? |
Suppose $X =(X_1,X_2)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X1, X2)$?
Suppose $X =(X_1,X_2)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X1, X2)$?
| $$X_1$$ | $$X_2$$ | $$P(Y=1\mid X_1,X_2)$$ | $$P(Y=0\mid X_1,X_2)$$ |
|---|---|---|---|
| 0 | 1 | 0.1 | 0.9 |
| 1 | 0 | 0.24 | 0.76 |
| 0 | 1 | 0.54 | 0.46 |
| 1 | 1 | 0.23 | 0.77 |
Suppose $X =(X_1,X_2)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X1, X2)$?
| $$X_1$$ | $$X_2$$ | $$P(Y=1\mid X_1,X_2)$$ | $$P(Y=0\mid X_1,X_2)$$ |
|---|---|---|---|
| 0 | 1 | 0.1 | 0.9 |
| 1 | 0 | 0.24 | 0.76 |
| 0 | 1 | 0.54 | 0.46 |
| 1 | 1 | 0.23 | 0.77 |
4 parameters: $P(Y=0\mid X_1,X_2) = 1-P(Y=1\mid X_1,X_2)$
Suppose $X =(X_1,X_2, ..., X_d)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X_1, ... X_d)$?
Suppose $X =(X_1,X_2, ..., X_d)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X_1, ... X_d)$?
| $$X_1$$ | $$X_2$$ | $$X_3$$ | ... | $$X_d$$ | $$P(Y=1\mid X)$$ | $$P(Y=0\mid X)$$ |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | ... | 0 | 0.1 | 0.9 |
| 0 | 0 | 0 | ... | 1 | 0.24 | 0.76 |
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| 1 | 1 | 1 | ... | 1 | 0.52 | 0.48 |
Suppose $X =(X_1,X_2, ..., X_d)$ where $X_i$ and $Y$ are boolean random variables
How many parameters do we need to estimate to know $P(Y| X_1, ... X_d)$?
| $$X_1$$ | $$X_2$$ | $$X_3$$ | ... | $$X_d$$ | $$P(Y=1\mid X)$$ | $$P(Y=0\mid X)$$ |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | ... | 0 | 0.1 | 0.9 |
| 0 | 0 | 0 | ... | 1 | 0.24 | 0.76 |
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| 1 | 1 | 1 | ... | 1 | 0.52 | 0.48 |
$2^d$ rows! ($2^d$ parameters!).
If we have 30 boolean $X_i$'s: ~ 1 billion rows!
We used bayes rule last lecture to express marginal and conditional probabilities of the data and parameter $\theta$:
$$ P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)} $$We can also express the conditional distribution of $Y$ given $X$:
$$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} $$BTW, this notation is a shorthand for:
$$ (\forall i, j) ~ P(Y=y_i|X=x_j) = \frac{P(X=x_j|Y=y_i)P(Y=y_i)}{P(X=x_j)} $$We used bayes rule last lecture to express marginal and conditional probabilities of the data and parameter $\theta$:
$$ P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)} $$We can also express the conditional distribution of $Y$ given $X$:
$$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} $$BTW, this notation is a shorthand for:
$$ (\forall i, j) ~ P(Y=y_i|X=x_j) = \frac{P(X=x_j|Y=y_i)P(Y=y_i)}{P(X=x_j)} $$Equivalently:
$$ (\forall i, j) ~ P(Y=y_i|X=x_j) = \frac{P(X=x_j|Y=y_i)P(Y=y_i)}{{\sum_k P(X=x_i|Y=y_k)P(Y=y_k)}} $$
| $$Y$$ | $$X_1$$ | $$X_2$$ | $$X_3$$ | ... | $$X_d$$ | $$P(X\mid Y)$$ |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | ... | 0 | 0.1 |
| 0 | 0 | 0 | 0 | ... | 1 | 0.02 |
| 0 | ... | ... | ... | ... | ... | ... |
| 0 | 1 | 1 | 1 | ... | 1 | 0.16 |
| 1 | 0 | 0 | 0 | ... | 0 | ? |
| 1 | 0 | 0 | 0 | ... | 1 | ? |
| 1 | ... | ... | ... | ... | ... | ? |
| 1 | 1 | 1 | 1 | ... | 1 | ? |
| $$Y$$ | $$X_1$$ | $$X_2$$ | $$X_3$$ | ... | $$X_d$$ | $$P(X\mid Y)$$ |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | ... | 0 | 0.1 |
| 0 | 0 | 0 | 0 | ... | 1 | 0.02 |
| 0 | ... | ... | ... | ... | ... | ... |
| 0 | 1 | 1 | 1 | ... | 1 | 0.16 |
| 1 | 0 | 0 | 0 | ... | 0 | 0.03 |
| 1 | 0 | 0 | 0 | ... | 1 | 0.2 |
| 1 | ... | ... | ... | ... | ... | ... |
| 1 | 1 | 1 | 1 | ... | 1 | 0.01 |
Still too many parameters!
If we have 30 boolean $X_i$'s: ~ 2 billion!
Definition: $X$ is conditionally independent of $Y$ given $Z$, if the probability distribution governing $X$ is independent of the value of $Y$, given the value of $Z$.
$$ (\forall i, j, k) ~ P(X=x_i|Y=y_j, Z = z_k) = P(X=x_i|Z=z_j) $$Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
Naïve Bayes is a classifier that assumes conditional independence of the variables $X_i$ given the label $Y$.
How does this assumption simplify $P(X_1,X_2|Y)$?
(more on this assumption soon!)
(more on this assumption soon!)
Training:
Training:
But... how do we estimate these parameters?
$P(X|Y=y_k)$ has parameters $\theta_{ijk}$, one for each value $x_{ij}$ of each $X_i$. $P(Y)$ has parameters $\pi$.
To follow the MLE principle, we pick the parameters $\pi$ and $\theta$ that maximizes the (conditional) likelihood of the data given the parameters.
| $O$ Is older than 35 years |
$I$ Has Personal Income |
$S$ Is a Student |
$J$ Birthday before July 1st |
$Y$ Buys computer |
|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 |
Removed one variable to simplify the problem + changed order of samples.
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| $$O$$ | $$S$$ | $$J$$ | $$Y$$ | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | |
| 0 | 0 | 1 | 0 | |
| 0 | 0 | 0 | 0 | |
| 1 | 1 | 1 | 0 | |
| 1 | 0 | 0 | 0 | |
| 0 | 1 | 0 | 1 | |
| 1 | 0 | 1 | 1 | |
| 1 | 0 | 1 | 1 | |
| 1 | 1 | 0 | 1 | |
| 0 | 1 | 1 | 1 | |
| 0 | 1 | 0 | 1 | |
| 0 | 1 | 1 | 1 | |
| 1 | 0 | 1 | 1 | |
| 0 | 1 | 0 | 1 |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = | | P(Y=0) =| |P(O=1\|Y=1) = | |P(O=1\|Y=0) = | |P(O=0\|Y=1) = | |P(O=0\|Y=0) = | |P(S=1\|Y=1) = | |P(S=1\|Y=0) = | |P(S=0\|Y=1) = | |P(S=0\|Y=0) = | |P(J=1\|Y=1) = | |P(J=1\|Y=0) = | |P(J=0\|Y=1) = | |P(J=0\|Y=0) = | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|0|0|**0**| |0|0|1|**0**| |0|0|0|**0**| |1|1|1|**0**| |1|0|0|**0**| | | | | | | | | | | | | |0|1|0|**1**| |1|0|1|**1**| |1|0|1|**1**| |1|1|0|**1**| |0|1|1|**1**| |0|1|0|**1**| |0|1|1|**1**| |1|0|1|**1**| |0|1|0|**1**| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = | |P(O=1\|Y=0) = | |P(O=0\|Y=1) = | |P(O=0\|Y=0) = | |P(S=1\|Y=1) = | |P(S=1\|Y=0) = | |P(S=0\|Y=1) = | |P(S=0\|Y=0) = | |P(J=1\|Y=1) = | |P(J=1\|Y=0) = | |P(J=0\|Y=1) = | |P(J=0\|Y=0) = | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|0|0|**0**| |0|0|1|**0**| |0|0|0|**0**| |1|1|1|**0**| |1|0|0|**0**| | | | | | | | | | | | | |0|1|0|**1**| |1|0|1|**1**| |1|0|1|**1**| |1|1|0|**1**| |0|1|1|**1**| |0|1|0|**1**| |0|1|1|**1**| |1|0|1|**1**| |0|1|0|**1**| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| |P(O=1\|Y=0) = | |P(O=0\|Y=1) = | |P(O=0\|Y=0) = | |P(S=1\|Y=1) = | |P(S=1\|Y=0) = | |P(S=0\|Y=1) = | |P(S=0\|Y=0) = | |P(J=1\|Y=1) = | |P(J=1\|Y=0) = | |P(J=0\|Y=1) = | |P(J=0\|Y=0) = | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|0|0|**0**| |0|0|1|**0**| |0|0|0|**0**| |1|1|1|**0**| |1|0|0|**0**| | | | | | | | | | | | | |0|1|0|**1**| |1|0|1|**1**| |1|0|1|**1**| |1|1|0|**1**| |0|1|1|**1**| |0|1|0|**1**| |0|1|1|**1**| |1|0|1|**1**| |0|1|0|**1**| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| | P(O=1\|Y=0) = 2/5 | |P(O=0\|Y=1) = 5/9 | |P(O=0\|Y=0) = | |P(S=1\|Y=1) = | |P(S=1\|Y=0) = | |P(S=0\|Y=1) = | |P(S=0\|Y=0) = | |P(J=1\|Y=1) = | |P(J=1\|Y=0) = | |P(J=0\|Y=1) = | |P(J=0\|Y=0) = | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|0|0|**0**| |0|0|1|**0**| |0|0|0|**0**| |1|1|1|**0**| |1|0|0|**0**| | | | | | | | | | | | | |0|1|0|**1**| |1|0|1|**1**| |1|0|1|**1**| |1|1|0|**1**| |0|1|1|**1**| |0|1|0|**1**| |0|1|1|**1**| |1|0|1|**1**| |0|1|0|**1**| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| | P(O=1\|Y=0) = 2/5 | |P(O=0\|Y=1) = 5/9 | |P(O=0\|Y=0) = 3/5| |P(S=1\|Y=1) = 6/9| |P(S=1\|Y=0) = 1/5 | |P(S=0\|Y=1) = 3/9| |P(S=0\|Y=0) = 4/5 | |P(J=1\|Y=1) = 5/9| |P(J=1\|Y=0) = 2/5 | |P(J=0\|Y=1) = 4/9| |P(J=0\|Y=0) = 3/5 | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|0|0|**0**| |0|0|1|**0**| |0|0|0|**0**| |1|1|1|**0**| |1|0|0|**0**| | | | | | | | | | | | | |0|1|0|**1**| |1|0|1|**1**| |1|0|1|**1**| |1|1|0|**1**| |0|1|1|**1**| |0|1|0|**1**| |0|1|1|**1**| |1|0|1|**1**| |0|1|0|**1**| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | New Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| | P(O=1\|Y=0) = 2/5 | |P(O=0\|Y=1) = 5/9 | |P(O=0\|Y=0) = 3/5| |P(S=1\|Y=1) = 6/9| |P(S=1\|Y=0) = 1/5 | |P(S=0\|Y=1) = 3/9| |P(S=0\|Y=0) = 4/5 | |P(J=1\|Y=1) = 5/9| |P(J=1\|Y=0) = 2/5 | |P(J=0\|Y=1) = 4/9| |P(J=0\|Y=0) = 3/5 | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|1|1|?| |
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | New Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| | P(O=1\|Y=0) = 2/5 | |P(O=0\|Y=1) = 5/9 | |P(O=0\|Y=0) = 3/5| |P(S=1\|Y=1) = 6/9| |P(S=1\|Y=0) = 1/5 | |P(S=0\|Y=1) = 3/9| |P(S=0\|Y=0) = 4/5 | |P(J=1\|Y=1) = 5/9| |P(J=1\|Y=0) = 2/5 | |P(J=0\|Y=1) = 4/9| |P(J=0\|Y=0) = 3/5 | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|1|1|?| |
\begin{eqnarray}Y^{\text{new}} &=& \underset{y_k}{\operatorname{argmax}} P(Y=y_k) P(O=0, S = 1, J = 1 | Y=y_k) \\ &=&\underset{y_k}{\operatorname{argmax}} P(Y=y_k) P(O=0| Y=y_k)P(S = 1 | Y=y_k) P( J = 1 | Y=y_k) \end{eqnarray}
P(Y=1) P(O=0| Y=1)P(S = 1 | Y=1) P( J = 1 | Y=1) = 9/14 * 5/9 * 6/9 * 5/9 =0.13
O= Is older than 35, S= Is a student, J = Birthday before July 1, Y= buys computer
| Estimated Parameters | New Data |
|---|---|
| |$$Y = 1$$| $$~$$ | $$Y = 0$$| |--|--|--| |P(Y=1) = 9/14| | P(Y=0) = 5/14| |P(O=1\|Y=1) = 4/9| | P(O=1\|Y=0) = 2/5 | |P(O=0\|Y=1) = 5/9 | |P(O=0\|Y=0) = 3/5| |P(S=1\|Y=1) = 6/9| |P(S=1\|Y=0) = 1/5 | |P(S=0\|Y=1) = 3/9| |P(S=0\|Y=0) = 4/5 | |P(J=1\|Y=1) = 5/9| |P(J=1\|Y=0) = 2/5 | |P(J=0\|Y=1) = 4/9| |P(J=0\|Y=0) = 3/5 | | | $$O$$| $$S$$ | $$J$$ | $$Y$$ | |:-:|:-:|:-:|:-:| |0|1|1|?| |
\begin{eqnarray}Y^{\text{new}} &=& \underset{y_k}{\operatorname{argmax}} P(Y=y_k) P(O=0, S = 1, J = 1 | Y=y_k) \\ &=&\underset{y_k}{\operatorname{argmax}} P(Y=y_k) P(O=0| Y=y_k)P(S = 1 | Y=y_k) P( J = 1 | Y=y_k) \end{eqnarray}
P(Y=1) P(O=0| Y=1)P(S = 1 | Y=1) P( J = 1 | Y=1) = 9/14 * 5/9 * 6/9 * 5/9 =0.132
P(Y=0) P(O=0| Y=0)P(S = 1 | Y=0) P( J = 1 | Y=0) = 5/14 * 3/5 * 1/5 * 2/5 = 0.017
Pick label 1!
Not required to make predictions, but we have everything we need:
$$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} $$$$P(X) = \sum_k P(X|Y=y_k)P(Y=y_k)$$P(Y=1) P(O=0| Y=1)P(S = 1 | Y=1) P( J = 1 | Y=1) = 0.132
P(Y=0) P(O=0| Y=0)P(S = 1 | Y=0) P( J = 1 | Y=0) = 0.017
P(Y=1 \| O=0, S = 1, J = 1 ) = 0.886
P(Y=0 \| O=0, S = 1, J = 1 ) = 0.114
To get an estimate of generalization performance, compute accuracy on held-out set (more about this soon).
Assume you train and use this algorithm to predict "Buy Computer?" with a large dataset, and obtain binary classification accuracy of 75%.
Usually the $X_i$ are not conditionally independent: $$P(X_1,...,X_d|Y) \ne\prod_i P(X_i|,Y)$$
Even if the "naïve" conditional independence assumption is not true in the data, Naïve Bayes might still perform well and is used anyways
Usually the $X_i$ are not conditionally independent: $$P(X_1,...,X_d|Y) \ne\prod_i P(X_i|,Y)$$
Even if the "naïve" conditional independence assumption is not true in the data, Naïve Bayes might still perform well and is used anyways
To see the effect of the violation of this assumption, consider the extreme case in which $X_i$ is a copy of $X_k$. What is the effect on $P(Y|X)$?
Consider for example that $P(O=1|Y=1)> P(O=1|Y=0)$, how does $P(Y=1|O=1,O^\prime=1,S=1,J=1)$ with the duplicated variable compare to respect $P(Y=1|O=1,S=1,J=1)$?
What if we have an irrelevant variable?
Assume J is independent of $Y$: $P(J|Y=0) = P(J|Y=1)= P(J)$
Does it hurt classification?
What if we have an irrelevant variable?
Assume J is independent of $Y$: $P(J|Y=0) = P(J|Y=1)= P(J)$
Does it hurt classification?
Taking the log of this ratio prevents underflow and expresses the decision rule in a useful way (we will see later more reasons why it's useful). \begin{eqnarray} \ln \frac{P(Y=1|X_1...X_d)}{P(Y=0|X_1...X_d)} = \ln \frac{P(Y=1)}{P(Y=0)} + \sum_i\ln \frac{P(X_i|Y=1)}{P(X_i|Y=0)} ~ ~ ~ ~ ~ > \text{ or } <0? \end{eqnarray}
Since $X_i$s are boolean, we can simplify the notation:
$P(X|Y=y_k)$ has parameters $\theta_{ijk}$, one for each value $x_{ij}$ of each $X_i$.
To follow the MLE principle, we pick the parameters $\theta$ that maximizes the conditional likelihood of the data given the parameters.
To estimate:
Example prior for $\pi_k$ where $K$ > 2:
if $K=2$ this becomes a Beta prior.
Example prior for $\theta_{ijk}$ where J>2 :
These priors will act as imaginary examples that smooth the estimated distributions and prevent zero values.
To estimate:
How shall we represent text documents for Naïve Bayes?
What are the limitations with this representation?
What are the limitations with this representation?
$X = (X_1, X_2, … X_d)$ = document
$X_i$ is a random variable describing the word at position i in the document
How many parameters do we need to estimate $P(X|Y)$? (say 1000 words per document, 10000 words in english)
$X = (X_1, X_2, … X_d)$ = document
$X_i$ is a random variable describing the word at position i in the document
How many parameters do we need to estimate $P(X|Y)$? (say 1000 words per document, 10000 words in english)
Conditional Independence Assumption very useful:
Since all $X_i$s have the same distribution, we only have to estimate one parameter per word, per class.
$P(X|Y=y_k)$ is a multinomial distribution:
Probability of observing a document with $\alpha_1$ count of $w_1$, $\alpha_{2}$ count of $w_2$ ... is a multinomial:
$$\frac{|D|!}{\alpha_1!\cdots \alpha_J!} \theta_1^{\alpha_1} \theta_2^{\alpha_2} \theta_3^{\alpha_3} \cdots \theta_J^{\alpha_J}$$Dirichlet Prior examples:
(Dirichlet is the conjugate prior for a multinomial likelihood function)
\begin{eqnarray} \theta_{jk} = \frac{\alpha_{jk} + \beta_{jk}-1}{\sum_m (\alpha_{mk} + \beta_{mk}-1)} \end{eqnarray}Again the prior acts like halucinated examples
What $\beta$s should we choose?
For code and data, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”.
Can group labels into groups that share priors:
comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.max.hardware, comp.windows.x
misc.forsale
rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey
alt.atheism,
soc.religion.christian,
talk.religion.misc, talk.politics.mideast, talk.politics.misc, talk.politics.guns,
sci.space, sci.crypt, sci.electronics, sci.med
• Naïve Bayes: 89% classification accuracy
Even when taking half of the email
Even when taking half of the email
More recently, algorithms such as LSTMs and Transformers have become very good
What can we do?
E.g. image classification, where $X_i$ is real valued
Classify a person’s cognitive state, based on brain image
60 distinct exemplars, presented 6 times each
Mitchell et al. Science 2008, data available online.
$Y$ is the mental state (reading “house” or “bottle”)
$X_i$ are the voxel activities (voxel = volume pixel).
Naïve Bayes requires $P(X_i | Y=y_k)$ but $X_i$ is continuous:
$$ P(Y=y_k|X_1,...,X_d) = \frac{P(Y=y_k) \prod_i P(X_i|Y=y_k) }{\sum_\ell (Y=y_\ell) \prod_i P(X_i|Y=y_\ell)} $$What can we do?
Naïve Bayes requires $P(X_i | Y=y_k)$ but $X_i$ is continuous:
$$ P(Y=y_k|X_1,...,X_d) = \frac{P(Y=y_k) \prod_i P(X_i|Y=y_k) }{\sum_\ell (Y=y_\ell) \prod_i P(X_i|Y=y_\ell)} $$What can we do?
Common approach: assume $P(X_i | Y=y_k)$ follows a Normal (Gaussian) distribution
$$p(X_i = x|Y=y_k) = \frac{1}{\sqrt{2\pi\sigma^2_{ik}}} \text{exp}\big({-\frac{1}{2}\frac{(x_i-\mu_{ik})^2}{\sigma_{ik}^2}}\big)$$Sometimes assume standard deviation
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm
x1 = np.linspace(-10,10,1000)
x2 = np.linspace(-10,10,1000)
# Assume I know the true parameters, this is not the case usually!
mu_1_1 = -5; sigma_1_1 = 2
mu_2_1 = 5; sigma_2_1 = 2
mu_1_0 = 5; sigma_1_0 = 2
mu_2_0 = -5; sigma_2_0 = 2
# Sample data from these distributions
X_positive = norm.rvs(loc=[mu_1_1,mu_2_1], scale=[sigma_1_1, sigma_2_1], size = (100,2))
X_negative = norm.rvs(loc=[mu_1_0,mu_2_0], scale=[sigma_1_0, sigma_2_0], size = (100,2))
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.axis([-10,10,-10,10],'equal');
P_X1_1 = norm.pdf(x1,mu_1_1,sigma_1_1)
P_X2_1 = norm.pdf(x1,mu_2_1,sigma_2_1)
P_X1_0 = norm.pdf(x1,mu_1_0,sigma_1_0)
P_X2_0 = norm.pdf(x1,mu_2_0,sigma_2_0)
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
lim_plot = 10
plt.plot(x1,P_X1_1*2*lim_plot-lim_plot,'r',linewidth=4)
plt.text(-7, -12, r'$P(X_1|Y=1)$', color = 'red',fontsize=20)
plt.plot(x1,P_X1_0*2*lim_plot-lim_plot,'b',linewidth=4)
plt.text(3, -12, r'$P(X_1|Y=0)$', color = 'blue',fontsize=20)
plt.axis([-10,10,-10,10],'equal');
plt.figure(figsize=(6,5))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
lim_plot = 10
plt.plot(x1,P_X1_1*2*lim_plot-lim_plot,'r',linewidth=4)
plt.text(-7, -12, r'$P(X_1|Y=1)$', color = 'red',fontsize=20)
plt.plot(x1,P_X1_0*2*lim_plot-lim_plot,'b',linewidth=4)
plt.text(3, -12, r'$P(X_1|Y=0)$', color = 'blue',fontsize=20)
plt.plot(P_X2_1*2*lim_plot-lim_plot,x1,'r',linewidth=4)
plt.text(-18,5, r'$P(X_2|Y=1)$', color = 'red',fontsize=20)
plt.plot(P_X2_0*2*lim_plot-lim_plot,x1,'b',linewidth=4)
plt.text(-18,-5, r'$P(X_2|Y=0)$', color = 'blue',fontsize=20)
plt.axis([-lim_plot,lim_plot,-lim_plot,lim_plot],'equal')
[-10, 10, -10, 10]
# Compute log( P(Y=1|X)/ P(Y =0|X))
# as log( P(Y=1)P(X1|Y=1)P(X2|Y=1) / P(Y =0|X))P(X1|Y=0)P(X2|Y=0) )
# Using the real parameters. Usually, we have to estimate these!
X1,X2 = np.meshgrid(x1, x2)
def ratio_log(X1,X2):
pY0 =0.5; pY1 = 1- pY0
pY1pXY1 = pY1*norm.pdf(X1,mu_1_1,sigma_1_1)*norm.pdf(X2,mu_2_1,sigma_2_1)
pY0pXY0 = pY0*norm.pdf(X1,mu_1_0,sigma_1_0)*norm.pdf(X2,mu_2_0,sigma_2_0)
return np.log(pY1pXY1/pY0pXY0)
fX = ratio_log(X1,X2)
plt.figure(figsize=(9,7))
# plot contour plot
cs = plt.contourf(X1, X2, fX,20,cmap='RdBu_r',alpha=0.8);
plt.colorbar()
contours = plt.contour(cs, colors='k',alpha=0.4) # this redraws the lines in black
plt.contour(contours,levels=[0],linewidth=5) # this makes the 0 line wider
# previous stuff
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w',s=60)
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w',s=60)
lim_plot = 10
plt.plot(x1,P_X1_1*2*lim_plot-lim_plot,'r',linewidth=4)
plt.text(-7, -12, r'$P(X_1|Y=1)$', color = 'red',fontsize=20)
plt.plot(x1,P_X1_0*2*lim_plot-lim_plot,'b',linewidth=4)
plt.text(3, -12, r'$P(X_1|Y=0)$', color = 'blue',fontsize=20)
plt.plot(P_X2_1*2*lim_plot-lim_plot,x1,'r',linewidth=4)
plt.text(-16,5, r'$P(X_2|Y=1)$', color = 'red',fontsize=20)
plt.plot(P_X2_0*2*lim_plot-lim_plot,x1,'b',linewidth=4)
plt.text(-16,-5, r'$P(X_2|Y=0)$', color = 'blue',fontsize=20)
plt.axis([-lim_plot,lim_plot,-lim_plot,lim_plot],'equal');
def ratio_log_updated (X1,X2,params): # just as an example, not used here
pY0 =0.5; pY1 = 1- pY0
pY1pXY1 = pY1*norm.pdf(X1,params['mu_1_1'],params['sigma_1_1'])*norm.pdf(X2,params['mu_2_1'],params['sigma_2_1'])
pY0pXY0 = pY0*norm.pdf(X1,params['mu_1_0'],params['sigma_1_0'])*norm.pdf(X2,params['mu_2_0'],params['sigma_2_0'])
return np.log(pY1pXY1/pY0pXY0)
def plot_GNB(X_positive,X_negative,params):
pY0 =0.5; pY1 = 1- pY0
P_X1_1 = norm.pdf(x1,params['mu_1_1'],params['sigma_1_1'])
P_X2_1 = norm.pdf(x1,params['mu_2_1'],params['sigma_2_1'])
P_X1_0 = norm.pdf(x1,params['mu_1_0'],params['sigma_1_0'])
P_X2_0 = norm.pdf(x1,params['mu_2_0'],params['sigma_2_0'])
X1,X2 = np.meshgrid(x1, x2)
# faster way to compute the log ratio, or can use fX = ratio_log_updated(X1,X2,params)
fX = np.log(pY1/pY0) + np.log(P_X1_1.reshape([1000,1]).dot(P_X2_1.reshape([1,1000]))/
P_X1_0.reshape([1000,1]).dot(P_X2_0.reshape([1,1000])))
plt.figure(figsize=(10,8))
# plot contour plot
cs = plt.contourf(X1, X2, fX,20,cmap='RdBu_r',alpha=0.8);
plt.colorbar()
contours = plt.contour(cs, colors='k',alpha=0.4)
plt.contour(contours,levels=[0],linewidth=5)
plt.scatter(X_positive[:, 0],X_positive[:, 1],facecolors='r',edgecolors='w',s=60)
plt.scatter(X_negative[:, 0],X_negative[:, 1],facecolors='b',edgecolors='w',s=60)
lim_plot = 10
plt.plot(x1,P_X1_1*2*lim_plot-lim_plot,'r',linewidth=4)
plt.text(-7, -12, r'$P(X_1|Y=1)$', color = 'red',fontsize=20)
plt.plot(x1,P_X1_0*2*lim_plot-lim_plot,'b',linewidth=4)
plt.text(3, -12, r'$P(X_1|Y=0)$', color = 'blue',fontsize=20)
plt.plot(P_X2_1*2*lim_plot-lim_plot,x1,'r',linewidth=4)
plt.text(-16,5, r'$P(X_2|Y=1)$', color = 'red',fontsize=20)
plt.plot(P_X2_0*2*lim_plot-lim_plot,x1,'b',linewidth=4)
plt.text(-16,-5, r'$P(X_2|Y=0)$', color = 'blue',fontsize=20)
plt.axis([-lim_plot,lim_plot,-lim_plot,lim_plot],'equal')
What if:
1st: case where same variance
from scipy.stats import multivariate_normal
# Same param as before
mu_1_1 = -5; sigma_1_1 = 2; mu_2_1 = 5; sigma_2_1 = 2
mu_1_0 = 5; sigma_1_0 = 2; mu_2_0 = -5; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
print(cov_positive)
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1],cov=cov_positive,size=(100))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0],cov=cov_negative,size=(100))
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.axis([-10,10,-10,10],'equal');
[[4 3] [3 4]]
# Assume I perfectly estimate the parameters (not true for limited data!)
params = dict(mu_1_1 = -5, sigma_1_1 = 2,mu_2_1 = 5, sigma_2_1 = 2,
mu_1_0 = 5, sigma_1_0 = 2,mu_2_0 = -5, sigma_2_0 = 2)
plot_GNB(X_positive,X_negative,params)
# Estimate
mu_1_1, mu_2_1 = np.mean(X_positive,axis=0)
mu_1_0, mu_2_0 = np.mean(X_negative,axis=0)
# Same Variance!
sigma_1_1, sigma_2_1 = np.std(X_positive,axis=0)
sigma_1_0, sigma_2_0 = np.std(X_negative,axis=0)
print(sigma_1_1, sigma_2_1)
print(sigma_1_0, sigma_2_0)
1.7556505128707445 1.830266323858797 1.8626233472002263 1.9122394301165095
What happens if we force $\sigma_{i0} =\sigma_{i1} $?
# Same param as before
mu_1_1 = -5; sigma_1_1 = 2
mu_2_1 = 5; sigma_2_1 = 2
mu_1_0 = 5; sigma_1_0 = 2
mu_2_0 = -5; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1], cov=cov_positive, size = (100))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0], cov=cov_negative, size = (100))
params = dict()
# Estimate - Different variance because of limited sample size
params['mu_1_1'], params['mu_2_1'] = np.mean(X_positive,axis=0)
params['sigma_1_1'], params['sigma_2_1'] = np.std(X_positive,axis=0)
params['mu_1_0'], params['mu_2_0'] = np.mean(X_negative,axis=0)
params['sigma_1_0'], params['sigma_2_0'] = np.std(X_negative,axis=0)
plot_GNB(X_positive,X_negative,params)
# Let's set up another example in which the variances are actually different
mu_1_1 = -5; sigma_1_1 = 3
mu_2_1 = 5; sigma_2_1 = 4
mu_1_0 = 5; sigma_1_0 = 2
mu_2_0 = -5; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1],cov=cov_positive,size=(100))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0],cov=cov_negative,size=(100))
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.axis([-10,10,-10,10],'equal')
[-10, 10, -10, 10]
params = dict()
# Estimate - Different variance
params['mu_1_1'], params['mu_2_1'] = np.mean(X_positive,axis=0)
params['sigma_1_1'], params['sigma_2_1'] = np.std(X_positive,axis=0)
params['mu_1_0'], params['mu_2_0'] = np.mean(X_negative,axis=0)
params['sigma_1_0'], params['sigma_2_0'] = np.std(X_negative,axis=0)
plot_GNB(X_positive,X_negative,params)
from sklearn import datasets
plt.figure(figsize=(5,5))
X, y = datasets.make_circles(n_samples=200, factor=.25,noise=.1)
# scale
X_positive = X[y==1]*8
X_negative = X[y==0]*8
<Figure size 360x360 with 0 Axes>
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.axis([-10,10,-10,10],'equal')
[-10, 10, -10, 10]
params = dict()
# Artificially force same variances
params['mu_1_1'], params['mu_2_1'] = np.mean(X_positive,axis=0)
params['sigma_1_1'], params['sigma_2_1'] = np.std(np.vstack([X_positive,X_negative]),axis=0)
params['mu_1_0'], params['mu_2_0'] = np.mean(X_negative,axis=0)
params['sigma_1_0'], params['sigma_2_0'] = np.std(np.vstack([X_positive,X_negative]),axis=0)
plot_GNB(X_positive,X_negative,params)
params = dict()
# Estimate - Different variance
params['mu_1_1'], params['mu_2_1'] = np.mean(X_positive,axis=0)
params['sigma_1_1'], params['sigma_2_1'] = np.std(X_positive,axis=0)
params['mu_1_0'], params['mu_2_0'] = np.mean(X_negative,axis=0)
params['sigma_1_0'], params['sigma_2_0'] = np.std(X_negative,axis=0)
plot_GNB(X_positive,X_negative,params)
Naïve Bayes classifier